26.1 Antalya port - Alternative syntax for cluster functions#1390
26.1 Antalya port - Alternative syntax for cluster functions#1390ianton-ru wants to merge 27 commits intoantalya-26.1from
Conversation
…ocal_cluster icebergLocalCluster table function
…ge_blocks Fix prefersLargeBlocks for cluster storage
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f5b1756a56
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| auto arguments = make_intrusive<ASTExpressionList>(); | ||
|
|
||
| arguments->children.push_back(make_intrusive<ASTLiteral>(connection_params.endpoint.storage_account_url)); | ||
| arguments->children.push_back(make_intrusive<ASTIdentifier>(connection_params.endpoint.container_name)); |
There was a problem hiding this comment.
Emit Azure container argument as a string literal
When StorageObjectStorageCluster rewrites Azure table-engine reads into table-function calls (for object_storage_cluster/cluster-mode execution), this argument builder emits the container as ASTIdentifier instead of ASTLiteral. Azure table-function parsing expects a literal container name, so distributed reads can fail with argument/identifier errors on remote nodes. This should serialize container_name as a literal string to keep rewritten queries valid.
Useful? React with 👍 / 👎.
src/Storages/IStorageCluster.cpp
Outdated
|
|
||
| updateBeforeRead(context); | ||
| auto cluster = getCluster(context); | ||
| auto cluster = getClusterImpl(context, cluster_name_from_settings, context->getSettingsRef()[Setting::object_storage_max_nodes]); |
There was a problem hiding this comment.
Limit host capping to object-storage cluster engines only
This unconditionally applies object_storage_max_nodes in base IStorageCluster::read, so non-object-storage engines that also inherit IStorageCluster (notably StorageURLCluster and StorageFileCluster) will also have shards silently dropped when the setting is non-zero. That can return partial query results for urlCluster/fileCluster even though the setting is documented for object-storage cluster functions.
Useful? React with 👍 / 👎.
| auto p = function_to_cluster_function.find(table_function->name); | ||
| if (p == function_to_cluster_function.end()) | ||
| { | ||
| throw Exception( | ||
| ErrorCodes::LOGICAL_ERROR, |
There was a problem hiding this comment.
Handle deltaLakeLocal in cluster-function rewrite
In cluster mode for table engines, DeltaLakeLocal is rewritten to table function deltaLakeLocal, but this conversion path only maps functions that have *Cluster equivalents and then throws LOGICAL_ERROR when none is found. As a result, reads of ENGINE=DeltaLakeLocal with non-empty object_storage_cluster fail with an internal error instead of a supported path or a proper user-facing validation error.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6048e5d947
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| #if USE_AVRO | ||
| using TableFunctionIcebergCluster = TableFunctionObjectStorageCluster<IcebergClusterDefinition, StorageIcebergConfiguration, true>; | ||
| using TableFunctionIcebergLocalCluster = TableFunctionObjectStorageCluster<IcebergLocalClusterDefinition, StorageLocalIcebergConfiguration, true>; |
There was a problem hiding this comment.
Remove duplicate IcebergLocal cluster type alias
This header now declares TableFunctionIcebergLocalCluster twice under the same #if USE_AVRO guard, which creates a redefinition in AVRO-enabled builds and breaks compilation of the table-function targets. Keeping a single alias declaration avoids a hard build failure for Iceberg-capable configurations.
Useful? React with 👍 / 👎.
| auto p = engine_to_function.find(storage_engine_name); | ||
| if (p == engine_to_function.end()) | ||
| { |
There was a problem hiding this comment.
Add S3-alias mappings before throwing on engine rewrite
This lookup throws when storage_engine_name is not in engine_to_function, but the rewrite path is also used for GCS, COSN, and OSS engines (their storage engine names are defined separately in StorageObjectStorageDefinitions.h). With object_storage_cluster enabled, queries against those engines now fail with LOGICAL_ERROR instead of executing, so these aliases need explicit mapping or a fallback in this conversion step.
Useful? React with 👍 / 👎.
| { | ||
| chassert(result.count == 0); /// We shouldn't use replacement with masking other arguments | ||
| result.start = url_arg_idx; | ||
| result.start = function->arguments->getRealIndex(url_arg_idx); |
There was a problem hiding this comment.
Align SAS masking index with skipped-argument offsets
After introducing skipped arguments (used for iceberg(..., storage_type=...)), Azure secret masking must always map back to original argument positions; this branch was updated for AccountKey, but the SharedAccessSignature branch below still writes a relative index. In queries like iceberg(storage_type='azure', '...SharedAccessSignature=...'), that mismatch masks the wrong argument and can leave the SAS token unredacted.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f3c7d36056
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| auto args_copy = args; | ||
| // Remove cluster name from args to avoid confusing cluster name and named collection name | ||
| args_copy.erase(args_copy.begin()); | ||
| configuration->extractDynamicStorageType(args_copy, context, &object_storage_type_arg); |
There was a problem hiding this comment.
Extract dynamic storage_type from live args
For old *Cluster syntax (for example icebergCluster(..., storage_type='azure')), this code extracts storage_type from args_copy instead of args, so the original argument remains in args and then gets appended again via object_storage_type_arg later in updateQueryToSendIfNeeded. The forwarded query can end up with two storage_type key-value arguments, and worker-side parsing rejects it with the duplicate-parameter error, which breaks distributed reads that use explicit storage_type.
Useful? React with 👍 / 👎.
| assert int(hosts_engine_distributed) == 3 | ||
|
|
||
|
|
||
| def test_distributed_s3_table_engine(started_cluster): |
There was a problem hiding this comment.
Remove duplicate test_distributed_s3_table_engine
This module defines test_distributed_s3_table_engine twice; in Python the later definition overwrites the earlier one at import time, so one full test body is never executed. That silently drops intended coverage and can hide regressions in this path while CI still reports green.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8ffa292f43
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| #endif | ||
|
|
||
| #if USE_AVRO | ||
| using TableFunctionIcebergCluster = TableFunctionObjectStorageCluster<IcebergClusterDefinition, StorageIcebergConfiguration, true>; |
There was a problem hiding this comment.
Route icebergCluster masking through storage_type-aware logic
TableFunctionIcebergCluster now uses StorageIcebergConfiguration, so icebergCluster(..., storage_type='azure', ...) can carry Azure secrets, but FunctionSecretArgumentsFinder::findOrdinaryFunctionSecretArguments still classifies icebergCluster under the S3 branch (findS3FunctionSecretArguments(true)) rather than the new Iceberg storage-type-aware masking path. In cluster queries this can leave account_key/connection-string secrets visible in query text and system.query_log (while masking the wrong positional argument), so the cluster Iceberg variant needs the same storage-type dispatch that was added for iceberg.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: df401f6c5e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| UInt32 shard_num = 0; | ||
| for (auto & shard_info : shards_info) | ||
| { | ||
| addresses_with_failover_.push_back(addresses_with_failover[shard_info.shard_num - 1]); |
There was a problem hiding this comment.
Preserve address mapping when capping cluster host count
constrainShardInfoAndAddressesToMaxHosts indexes addresses_with_failover with shard_info.shard_num - 1, but in replicas-as-shards mode shard_num comes from the original shard id (not the selected replica position) and can repeat across replicas. When object_storage_max_nodes is used on a cluster that has replicas, this remaps selected shards to wrong/duplicate addresses and can desynchronize shards_info from addresses_with_failover, so distributed object-storage reads are sent to unintended hosts and lose expected parallelism.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
constrainShardInfoAndAddressesToMaxHosts called from Cluster constructor with ReplicasAsShardsTag, each node is in separate shard, so no shards with more than one replica.
| auto storage_type = findIcebergStorageType(named_collection_pos); | ||
|
|
||
| if (storage_type == "s3") | ||
| findS3FunctionSecretArguments(false); |
There was a problem hiding this comment.
Use cluster argument offset for icebergCluster secret masking
For icebergCluster(...), this path always calls the non-cluster masking helpers, so argument indexing starts at position 0 instead of skipping the cluster name. With positional credentials (for example S3-style access key + secret), the code masks the wrong argument and can leave the actual secret key visible in logs/query text, which is a sensitive-data regression specific to icebergCluster queries.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0da7fa3b9e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| arguments->children.push_back(make_intrusive<ASTLiteral>(s3_settings->auth_settings[S3AuthSetting::access_key_id].value)); | ||
| arguments->children.push_back(make_intrusive<ASTLiteral>(s3_settings->auth_settings[S3AuthSetting::secret_access_key].value)); | ||
| if (!s3_settings->auth_settings[S3AuthSetting::session_token].value.empty()) | ||
| arguments->children.push_back(make_intrusive<ASTLiteral>(s3_settings->auth_settings[S3AuthSetting::session_token].value)); |
There was a problem hiding this comment.
Forward full S3 auth options in distributed rewrites
StorageObjectStorageCluster::updateQueryForDistributedEngineIfNeeded builds remote table-function calls from createArgsWithAccessData(), but this serializer only emits URL plus key/secret/session/NOSIGN fields. Table-level S3 auth passed via extra_credentials (for example role_arn/role_session_name) and related auth options are not forwarded, so queries against S3 engine tables that rely on those settings can fail on worker nodes after rewrite because they execute with different credentials.
Useful? React with 👍 / 👎.
| arguments->children.push_back(make_intrusive<ASTLiteral>(connection_params.endpoint.storage_account_url)); | ||
| arguments->children.push_back(make_intrusive<ASTLiteral>(connection_params.endpoint.container_name)); | ||
| arguments->children.push_back(make_intrusive<ASTLiteral>(blob_path.path)); | ||
| if (account_name && account_key) | ||
| { |
There was a problem hiding this comment.
Preserve Azure auth context when serializing engine args
The rewritten Azure arguments only include storage_account_url, container, blob path, and optional account key pair, but omit other parsed auth state (notably SAS/workload-identity details). Since distributed engine reads are rewritten through this serializer, Azure tables configured with SAS URL auth or explicit workload identity credentials lose required auth data on remote nodes and can fail with authorization errors.
Useful? React with 👍 / 👎.
…_rest_warehouses 25.8 Antalya ports: Improvement for Iceberg REST catalog
…for_iceberg_timestamptz Timezone for iceberg timestamptz
…etrics More profile metrics for Iceberg, S3 and Azure
Documentation for Swarm features in Antalya branch
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Frontports for Antalya 26.1
CI/CD Options
Exclude tests:
Regression jobs to run: